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1 Relationship between relative expression levels, read generating 
probabilities and expected read coverage 

For simplicity, we assume that RNA-Seq reads are sequenced uniformly across the transcriptome. 
In addition, we only consider hxed length single-end RNA-Seq reads. We denote the read length as 
L. Given a set of Mt transcripts, we denote the relative expression levels as r = (ti, T2, ■ • • , tm t ), 
the read generating probabilities as 6 = [6q,6\,... ,6m t ) an d the expected read coverage as E = 
(£o> £i> £2, ■ • • j £m t )- We further denote the lengths of transcripts as L = (Iq, h,. . . , Im t ) and let 
Iq = L. Here transcript 0 refers to a non-existing "noise" transcript. It represents all reads that are 
generated from the background noise. 



According to [3] , 



= (1 - #0) • 5T7 , I > 0. 



Then the expected read coverage of a transcript is defined as 

& = rrv jaa 

It is easy to see that £j oc tj for j > 0. 

Because the expected read coverage of a contig is defined as the expected read coverage of its parent 
transcript, the expected read coverage of the ith contig is 

where t(i) is contig i's parent transcript's index number and we define t(0) = 0. 
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2 The RSEM-EVAL model as an approximation of the "natural" 
model 

2.1 The "natural" model 

As described in the main text, Figure SI (a) shows a natural way of generating both the RNA-Seq 
data set and the assembly. We will refer to it as the "natural" model. In the "natural" model, we 
first generate the number of transcripts, Mt- Then a set of transcript sequences, T, and their relative 
expressions, r, are generated. Given T and r, we first generate the read generating probabilities, 
O and then use the RSEM model described in [3, 2] to generate a single-end RNA-Seq read data 
set. In the end, the "true" assembly A with overlap length w = 0 is constructed from the transcript 
sequences T with the help of hidden information that specifies the origin of each read (generated 
by the RSEM model). 

For simplicity of presentation, we describe and depict in Figure SI (a) the basic RSEM model, which 
was introduced in [3]. In practice, RSEM-EVAL uses a fuller extended model as described in [2]. 
In the basic RSEM model, given the probabilities, 0, of a read being generated from each of the 
possible transcripts, N reads are generated. Then for each read n, the transcript from which it is 
derived, G n , its start position on that transcript, S n , and its orientation, O n , are generated. Finally, 
the read sequence, R n , is generated based on its true alignment and a sequencing error model. In 
the RSEM model, only R n is observed. G n , S n and O n are all hidden variables. We denote all of 
these hidden variables by H = (G, S, O). 

The joint probability of an assembly and the RNA-Seq data in the "natural" model can be expressed 
as 



P{A, D) = P(M T ) Y P(T\M T )P(T\M T )P(e\T,T)J2P(H\Q)P(D\H,T)P(A\H,T). 



2.2 Derivation of the RSEM-EVAL model from the "natural" model 

Unfortunately, directly calculating P(A, D) under the "natural" model is computationally infeasible 
because it requires us to sum over all possible transcript sets. Therefore, we use some approximations 
of the "natural" model so that we can calculate P(A, D) more efficiently. We can rewrite the 
probability P(A, D) as follows: 



In the "natural" model, P(A\A) is hard to compute because the contigs are not conditionally 
independent given A. Similarly, P(D\A, A) is also hard to compute. Therefore we make the following 
two approximations so that we can compute P(A\A) and P(D\A,A) efficiently: 



T,r,e 



H 
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1. We assume that given the expected read coverage, the contigs are generated independently, 
i.e., 

M 

P(A\A) = l[P(A i \X i ). 

i=i 

2. To calculate P(D\A, A), we first ignore the dependency between RNA-Seq reads and treat the 
assembly as the true transcript set. That is, we calculate the RSEM likelihood: 

JV 

Prsem{D\T = A, 9) = Y[PRSEM(Ri\T = A, 6). 

i=i 

We then correct Prsem{D\T = A,Q) by a term that takes into account the dependencies 
between the reads to obtain our approximation of P(D\A,A). We will see later that this 
approximation is principled. 

With these two approximations, we obtain the RSEM-EVAL model shown in Figure Sl(b). There- 
fore, the RSEM-EVAL model can be viewed as an approximation to the "natural" model. 

3 Derivation and calculation of the contig length distribution 

Our goal is to define and calculate P(£\\) such that it closely matches the contig length distribution 
implied by the "natural" model. 

3.1 A procedure to generate contig lengths that is closely related to the "nat- 
ural" model 

First, let us consider the following procedure (Procedure 1) for generating contig lengths with a 
given expected read coverage A (We assume that the overlap length w is already given). This 
procedure is closely related to the "natural" model we mentioned before. We use this procedure to 
help us define the contig length distribution. 

Procedure 1 Generation of contig lengths given a A. 

1) Sample one transcript length t from the transcript length distribution, P(t). 

2) Sample the number of reads generated from each of the t — L + 1 valid positions in the transcript 
based on Poisson distributions parameterized with A. 

3) Merge any two reads that overlap at least w bases to construct "true" contigs. 
return The contig lengths obtained from step 3). 



Procedure 1 defines a conditional distribution over the tuple of transcript length, contig start posi- 
tion and contig length, P(t,pos, £\X). It represents the conditional probability of existing a contig 
of length £, which starts at position pos (ranges from 0 to t — L) of a transcript with length t, given 
the expected read coverage A. We can decompose P(t, pos, £\X) as 
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P(t,pos,e\X) = P{t\X)P(pos\t, X)P(£\pos,t, A). 



(1) 



We will discuss each term in the right hand side of (1) separately. 
P(t\X) We assume that 

P(t\X) = P(t), (2) 
and t, the transcript length follows a negative binomial distribution, 

t~NB(r,p), P{t = k)=^ k+ r r _~ l ^p r {l-p)\ fc = 0,l,... 

To justify our use of the negative binomial distribution here, we fit the empirical transcript length 
distribution obtained from real mouse Ensembl annotation with Poisson, geometric and negative 
binomial distributions. We found that the negative binomial distribution fits the empirical transcript 
length distribution the best (likelihood-ratio test, p ps 0). 



P(pos\t, X) Because the number of reads generated from a position in a transcript approximately 
follows a Poisson distribution with parameter A, we use iid Poisson distributions to approximate 
the read generating process from a transcript. Furthermore, we do not distinguish between reads 
coming from the forward or reverse strands and let a read's start position be its leftmost position 
in the forward strand. We denote 

Px = e- X , 

which is the probability of generating no reads from a position in the transcript under the Poisson 
assumptions. 

In order to have a contig start at position pos, we need to make sure that 

• There are no reads generated at the min(L — w,pos) positions before pos. Otherwise, these 
reads will merge with the segment at pos to form a contig starting before pos. The probability 

r , , . , . min(L— w,pos) 

of this event is Px 

• There is at least one read generated at position pos. The probability of this event is 1 — p\. 
Therefore we have 

P{pos\tA)=pT {L - W ' P ° S \l-Px). (3) 

P{£\pos, t, A) Because the transcript must contain the contig, we must have pos + £ < t. That 
means if £ > t — pos, then P(£\pos, t, X) = 0. Now let us assume that £ < t — pos. In order to have 
a contig of length £ start at pos, we need to make sure that 

• The reads starting at pos can be extended to a segment of length £. This implies that there 
must be at least one read at position pos + £ — L. 
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• No reads are generated at the min(L — w,t — pos — £) positions after position pos + £ — L. 
Otherwise, the generated reads would merge with the segment to form a contig with a longer 
length. The probability of this event is p™ m ( i_u, ' t -P os ~ £ )_ 

We denote the function f\(l) as the probability of extending a read to a segment of length £ by 
merging reads to its right. f\(£) can be calculated using the following recurrence: 



, £ < L 

h(£) = { 1 , £ = L . (4) 




{L-j))p L - j -\l-p x ) , £>L 



When £ > L, we need at least one read to cover the last L bases, which explains 1 — p\. Because we 
need at least w bases of overlap to merge a segment and the read(s) at its end, any segment that 
overlaps the read(s) at its end by w to L — 1 bases should be considered. For each possible overlap 
length j, f\(£ — (L — j)) denotes the probability of having a segment that overlaps j bases with the 
read(s) at its end. p^ ■* 1 guarantees that the overlap length will not exceed j. 

P(£\pos, t, A) can be expressed in terms of f\(£): 

\ 0 , I > t - pos 

P(£\ P OS,t,X) = | A( ^min(W-p OS -*) £ < t _ pQS ■ ( 5 ) 

Putting things together by plugging (2), (3) and (5) into (1), we can express P(t,pos,£\X) as 

Jo , £ > t - pos 

P(t, P OS,£\X) = j p(t)AW(1 _^ ) ^in(L-^,p OS ) + min(W- pOS -^) g < ^ _ ^ • (<>) 

3.2 Defining the contig length distribution P(£\\) 

Given P(t, pos, £\X) defined in the last section, it is natural to define P(£\X) as 

P(£\X) = Y,P(t,pos,£\X). (7) 

t,pos 

However, the events (t,pos,£) are not mutually exclusive because one pass of Procedure 1 can 
produce multiple contigs. Therefore, the definition in (7) will result in an invalid distribution for 
that J2eP(^\^) > 1- Thus we have to define P(£\X) by an alternative procedure (Procedure 2): 

In RSEM-EVAL, we define P(£\X) as the limit distribution of Procedure 2 when N — > oo. By 
defining 

Ep/ ,\ ^t—£ min(L—w,pos)+min(L~w,t—£—pos) 
_ t f_W Z^pos=oP\ ,o\ 

CX[ - > ~ ^ p( nsr t'-L mm(L-w, P os') ' ^ > 

Z^t' r[ . z ) l^pos'=oP\ 

P(£\X) can be expressed as 

P{£\X) = c x (£).f x (£). (9) 
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Procedure 2 Define P(l\\). 



Initialize an empty bag. 
repeat 

1) Run one pass of Procedure 1. 

2) Put all contigs produced from 1) into the bag. 
until The number of iterations, N, is large enough 

Define P(£\X) as the frequency of contigs with length t in the bag. 



If we set N = Mt in Procedure 2 and let all transcripts' expected read coverage be A, P(£\X) defined 
above is roughly equivalent to the probability of a randomly picked contig having length i in an 
instance of the "natural" model. Below we provide a proof for the correctness of (9). 

Proof: We define Xt tPOS / as the indicator variable that there is a contig generated from position 
pos of a length t transcript with length I in one iteration of Procedure 2. Then E{X t)POS ^) is 

for all I < t — pos. We also denote X ttPOS = Yle Xt,pos,l be the indicator variable that there is a 
contig generated from position pos of a length t transcript. It's easy to see that 

E(X t , pos ) = P(t)p^ L - w ' pos \l- Px ). 

Let Xi be the number of contigs with length t in one iteration and Xf 0 t be the total number of 
contigs in one iteration. We have 

Xe = X t:P os,e, 

t,pos 

Xtot = ^2 X t ,pos- 
t,pos 

Therefore by the property of expectation, we have 

t-e 



E(x e ) = j2 E ( x t, P os,e) = h(m-px)J2 p ( t )J2px 

t,pos t pos=0 

E(X tot ) = E(Xt, P os) = (1 - Pa) E P V E pf< L ~^ 

t,pos t pos=0 

Now we define the sample mean of Xi and X tot as 

1 N 

i=i 
1 N 

Xtot = — E^tot' 



min(L— ui,pos)+min(L— w,t— I— pos) 



i=l 
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U) (i) 

where X\ and X].J t are the corresponding indicator variables in the zth iteration of Procedure 2. 

By the law of large numbers, when TV — > oo, the sample mean converges in probability to the 
expectation. Thus, we have 

t-e 

Xe A E(X e ) = f x (£)(l- Px )J2P(t) E pf m(L - w ^ os)+mi ^ L - w ^ pos \ 

t pos=0 

x tot A E(x tot ) = (i- Px )J2P(t)Epf n{L ~ w ' pos) - 

t pos=0 

Then according to Procedure 2, 

= (11) 

f „ \ pwrf"' min(L-tu,pos)+min(L-«;,t-^-pos) 

^ (i-PA)E,^OE*' 0 -/ = oPr (L ^ pos,) ' 

AW, (13) 



£t p (*) Ep OS =oK 



P(+/\ X^t'-L min(L— u),pos') 



= c A (£) • f x (£). (14) 
where (12) follows from (11) by Slutsky's theorem. ■ 

3.3 Practical considerations 

In practice, transcript lengths are not likely to be available. However, we can estimate the transcript 
length distribution from a related species with a known transcript set. Because we do not estimate 
the distribution directly from the species whose transcriptome is sequenced, we want to make 
sure that the RSEM-EVAL score is not sensitive to the estimated transcript length distribution. 
Therefore, we conducted the following experiment. 

Note that in (9), only c\(i) involves the transcript length distribution. Thus, we experimented by 
removing ca^(^)s from the RSEM-EVAL scores for the assemblies on the simulated mouse data set. 
We then calculated Spearman's rank correlation coefficients between the modified RSEM-EVAL 
scores and reference-based measures (Table SI). The results suggest that c\ i (£i) has little impact 
on the RSEM-EVAL score. Thus, the RSEM-EVAL score should not be sensitive to the estimated 
transcript length distribution. 

For users' convenience, RSEM-EVAL provides a script (estimateParams) to estimate the negative 
binomial parameters from a given transcript set. 
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4 Calculation of the likelihood term 



As noted in the main text, the likelihood term can be written as 

P /nu a ^ - P RSEm{D\T = A, Q C MLE ) 

P(D\A, A MLE ) - PRSEM{c = MT = AQCMLE y (15) 
where @ C MLE is the contig- level read generating probabilities converted from Amle by 

q C _ ^MLE,i(^i — L + l) 

T,jLo ^MLE,j (£j- L+l) 

We approximate the likelihood correction term (denominator) by the following procedure: 

M 

Prsem(C = l\T = A, Q C MLE ) « 11 P{d = l\a u A'J, 

i=l 

p(c i = i\a i ,>! i )^(i-p x r)f K (e i ), 

N91 



A- 



MLEA 



t\ — L + l 

Here we decompose the probability of covering the assembly into the products of probabilities of 
covering each contig. The probability of covering a contig is further approximated using the Poisson 
assumptions we used in calculating the contig length distribution. In order to cover a contig, we need 
first generate reads from the leftmost position of the contig and then extend these read(s) to the 
end. The py and fy_(li) are defined the same as before, except that we replace the transcript-level 
expected read coverage, A, with the contig-level expected read coverage, A'. 



4.1 Justification of RSEM-EVAL's likelihood correction term 

A slightly modified "natural" model. Our justification is based on a slightly modified "nat- 
ural" model: given the true transcripts and their expression levels, we first generate the number of 
reads starting from each position on each transcript. Let n'^ be the number of reads generated at the 
jth position (forward strand) of transcript i. We assume that the n^s are generated independently 
and that 

n'^ ~ Poisson(£i), 

where £j is the expected read coverage of the ith transcript. We denote by T = { n ij} the set of the 
number of reads generated from each position. Note that n'^ indicates location information but not 
order information (e.g., which reads are generated at the jth position of transcript i). Thus, the 
next step is to recover the read generating order information. The probability of any possible read 
generating order is 

Ylij n ij ' 

which is the inverse of the multinomial coefficient. Thus far, we know each read's original transcript 
and location in the forward strand. In the next step, we generate each read's orientation and read 
sequence, which is similar to the "natural" model. Lastly, the assembly is constructed based on T 
and the transcript sequences. 
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Represent T by contig positions. We can further decompose T as 

T={C,V}, C = {n tj ] a ndV = {t(-),pos(t(-),-)}. 

where is the number of reads generated at the jth position of the ith contig in the assembly. 
t(i) maps the ith contig to the index of its parent transcript. pos(t(i),j) maps position j in contig 
i to the corresponding position in transcript t(i). Then we have 

n ij = n t(i),pos(t(i),j)i and X i = &(<)■ 

In addition, C collects all positions in the transcript set that contain a positive number of reads, 
i.e., riij > 0. 



Derivation of the likelihood correction term. The probability of generating an assembly 
count vector, C, given the assembly, the assembly expected read coverage, the true transcripts, the 
true transcript expected read coverage and the location of each contig in the assembly (V) is 

mv,A,A,T,z) = P f^miy (16) 

P(C,V,A,A\T,~) (17) 



Ec>P(C',P,A,A\T,Ey 

llt=lllj=0 n l3 \ e 



provided that P(V, A, A\T, S) > 0. 



For all T' = {C , P} compatible with A, T' must share the same set of positions that do not generate 
any read with T ■ As a consequence, the probabilities at positions generating no reads are the same 
for both the numerator and denominator of (17). Therefore, these probabilities cancel out and we 
obtain (18). 

If we define 

M ii—L .nij 
i=l j=0 13 ' 

then 

P(D\A,A) = £P(£>|C)P(C|A,A), (19) 
c 

= ^P(D\C) P(C,P,T,~\A,A), (20) 

C V,T,S 

= Y,P(D\C) P(V,T,E\A,A)P(C\P,A,A,T,E), (21) 

C V,T,S 

= E P ^' C V T7\ E PCP,T,S\A,A), (22) 



UZi^-px^fxM)' 



(23) 
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(22) follows from (21) for the reason that P(P, T, E\A, A) > 0 P(V,A,A\T,E) > 0. The denom- 
inator in (23) is the likelihood correction term. 



Calculation of the numerator. The numerator in (23), Ec P(D\C)g(C, A), assumes that reads 
can only be generated from transcript positions that are part of a contig. This is roughly equivalent 
to calculating the likelihood only from positions within a contig under the "natural" model. Suppose 
that the nth read comes from contig i. Although we cannot determine G n and S n for this read in 
the "natural" model, we still know that 



P(G n , S n ) — 



N' 



Therefore we can calculate the equivalent part of P(D\C)g(C, A) in the "natural" model. 

Lastly, in RSEM-EVAL, we use A^ instead of A; in the calculation. The relationship between A^ and 
Aj is 

A' = — A 

1 EXoA^-L + l) *' 



5 Contig impact score 



In this section, we will derive RSEM-EVAL's contig impact score. 



5.1 Decomposition of the RSEM log likelihood term 



First, let us look at log Prsem(D\T = A,Q MLE ), the log transform of the numerator of (15). 
For simplicity, we assume that this term follows the basic RSEM model [3] and denote by Z n ijk 
the indicator random variable sum ni arizing the hidden information for read tz, where Z n {j]^ — i if 
(G n , S n , O n ) = k). We use Z to summarize all indicator variables, Z = {Z n ijk}. We reorganize 
log Prsem(D\T = A, Q MLE ) as 



N 



\0g P RS EM (D\T = A, Q C MLE ) = Yj l0gPRSEM ^ e MLE) 



n=l 
N 



= y^y^ PRS EM (Znijk = l\®MLE> r n) log 



PRSEM(Z n jjk = l,r n \G c MLE ) 

PnsEM(Z nijk = l\Q c MLE ,r n y 



n=l i,j,k 
N 

^ ^ PRSEM{Z n jjk = l\Q C MLEi r n)logPRSEM{Z n ijk = l,r n \@ MLE ) 
n=l i,j,k 



N 



^ ^ PRSEM(Z n jjk = l\ ( 3>MLEi r n)logPRSEM(Z ni j k = l\Q MLE ,r n ), 
n=l i,j,k 



M 



= ^liki-H(Z\D,Q c MLE ), 



i=0 
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where liki is the expected complete likelihood for contig i, 

Uki = ^2 PRSEM{Znijk = M@M LE' r n) log PRSEM(Znijk = ^,r n \Q c MLE ), 

n,j,k 

and H(Z\D, @ C MLE ) is the posterior entropy of Z. 

5.2 Decomposition of the RSEM-EVAL score 

Because the assembly prior, BIC, and likelihood correction terms can also be decomposed into contig 
components, we can rewrite log P(A, D) as 

1 M 

log P(A, D) = (liko - - log AT) + score, - H(Z\D, Q C MLE ), 

i=i 

where scorei is defined as 

score, = log P(£i,Si\X M LE,i) + Uh - \ogP(d = l|^,Sj,A-) - - log N. 

5.3 RSEM-EVAL's contig impact score 

We denote the contig impact score for contig % as bi. It is defined as the log of the ratio between two 
hypotheses. The first hypothesis is that contig i is real. The second (null) hypothesis is that the 
reads composing contig i are actually from the background noise (i.e., contig i is not real). In order 
to avoid the expected reads from contig i being assigned to other isoforms, in the null hypothesis, 
we fix the posterior probability P(Z\D, Q C MLE ) and only replace liki in logP(^4, D) by lik { , 

li h = ^2 P RSEM(Z n ijk = M@M LE, r n) log P RS EM(Z n 0 = 1, r n \Q C MLE ), 
n,j,k 

where Z n Q = 1 means that read n is from the background noise. 
Thus the contig impact score, bi, for contig i, becomes 



h = log P{£i,Si\\ M LE,i)- log P(Ci = l\£i, Si , \'i)- -log N 

-l^p (7 -il« c r m~~ P RSEM{Z n jjk = l,r n \Q c MLE ) 

+ I, PRSEM{Znijk - 1\VmLE, r n) log — — — -. 

PRSEM(Z n 0 = l,r n \0 MLE ) 

6 Definition of contig precision, recall, and F\ 

In this subsection, we provide detailed definitions of the contig precision, recall, and F\. Throughout, 
A denotes the assembly, and B denotes the reference. Both A and B are thought of as sets of 
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sequences. As discussed in the main text, the reference can be either an estimate of the "true" 
assembly or a collection of full-length reference transcripts. In this paper, we have used an estimate 
of the "true" assembly as the reference. 

First, the contig recall is defined as follows: 

• Align the assembly A to the reference B, using Blat. We use default settings for Blat, except 
that we require 80% identity (under Blat's definition of percent identity), instead of the default 
90% identity, in order to generate more candidate alignments. 

• Throw out alignments that are to the reverse strand, if in strand-specific mode. 

• Throw out alignments whose fraction identity (defined below) is less than a parameter, the 
minimum fraction identity (0.99 in this paper). 

• Throw out alignments whose fraction indel (defined below) is greater than a parameter, the 
maximum fraction indel (0.01 in this paper). 

• Construct a bipartite graph from the remaining alignments, in which there is an edge between 
a £ A and b £ B iff there is a remaining alignment I of a to b. 

• The contig recall is the number of edges in the maximum cardinality matching of this graph, 
divided by the number of sequences in the reference B. 

The contig precision is defined as follows: interchange the assembly and the reference, and compute 
the contig recall. The contig F\ is the harmonic mean of the precision and recall. 

The fraction identity of an alignment I from a to b is defined as min(x/y, x/z), where 

• x is the number of bases in a that are aligned to an identical base in b, according to I, 

• y is the number of bases in a, and 

• z is the number of bases in b. 

The fraction indel of an alignment I from a to b is defined as max(w/y,x/z), where 

• w is the number of bases that are inserted in a, according to / (Blat's "Q gap bases"), 

• x is the number of bases that are inserted in b, according to I (Blat's "T gap bases"), 

• y is the number of bases in a, and 

• z is the number of bases in b. 



12 



7 Definition of nucleotide precision, recall, and F\ 



In this subsection, we provide detailed definitions of the nucleotide precision, recall, and F±. The 
assembly A and reference B are as described in the previous subsection. 

First, the nucleotide recall is defined as follows: 

• Align the assembly A to the reference B, using Blat. We use default settings for Blat, except 
that we require 80% identity (under Blat's definition of percent identity), instead of the default 
90% identity, in order to generate more candidate alignments. 

• Throw out alignments that are to the reverse strand, if in strand-specific mode. 

• Throw out alignments that are shorter than a parameter, the minimum fragment length. (In 
this paper, we have used the read length as the minimum fragment length.) 

• Add each remaining alignment to a priority queue, with priority equal to the number of 
identical bases in the alignment. 

• Let numer = 0. 

• While the priority queue is not empty: 

— Pop the alignment / with highest priority. 

— Add the number of identical bases in the alignment to numer. 

— Subtract I from all the other alignments in the queue and update their priorities (see 
below) . 

• Let denom be the total number of bases in the reference B. 

• The unweighted nucleotide recall is numer/denom. 

The actual implementation uses a more complicated and efficient algorithm than the one above. 

The nucleotide precision is defined as follows: interchange the assembly and the reference, and 
compute the nucleotide recall. The nucleotide F\ is the harmonic mean of the precision and recall. 

Alignment subtraction, used in the definition of the nucleotide recall, is defined as follows: 

• An alignment I from a to b can be thought of as a set of pairs of disjoint intervals 

{{[si(a), ei(a)], [si(b), ei(b)]), ([s n (a), e n (a)], [s n (b), e n (b)])}, 

where each pair ([sj(a), ej(a)], [si(b), ej(6)]) corresponds to an ungapped segment of the align- 
ment: Si(a) and e»(a) are the segment's start and end positions within a, and Sj(6) and ei(b) are 
the segment's start and end positions within b. In the case of non-strand-specific alignments, 
Si(b) might be greater than ej(fe). 

• If I is an alignment from a to b, I' is an alignment from a' to b' , a ^ a', and b ^ b' , then the 
difference I — I' = I. 
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• If I is an alignment from a to b, I' is an alignment from a' to b' , a = a', and b ^ b' , then 
the difference I — I' = I" , defined as follows. Each alignment segment of I is compared to 
the alignment segments of I'. If a segment of / overlaps one of the segments of I' wrt a, 
it is truncated so as to avoid the overlap. This truncation may result in zero, one, or two 
replacement alignment segments. (If the overlapping alignment segment of V is contained 
strictly within the segment of /, wrt a, two segments will result.) 

• If I is an alignment from a to b, I' is an alignment from a' to 6', a ^ a', and 6 = 6', then the 
difference I — V = I" , defined similarly as in the previous item, except overlaps are examined 
and resolved wrt b. 

8 The relationship between the KC and RSEM-EVAL scores 

In this section we explain the mathematical relationship between the KC and RSEM-EVAL scores. 
To do so, we first present another reference-based measure, the expected description length, which 
has an even closer relationship to the RSEM-EVAL score. We then show how maximizing the KC 
score is equivalent to maximizing a simplified version of the expected description length measure. 

8.1 Expected description length as a referenced-based measure 

A transcriptome is defined as the set of transcripts and their abundances. Therefore it is natural 
to define the ground truth as a set of true transcripts and their relative abundances. We denote 
the ground truth as (T, r), where T is the set of transcripts, r represents each transcript's relative 
abundance. Given a k-mer size k, the ground truth induces a distribution over all possible /c-mers. 
We denote this distribution as p(x). 

Having p(x), we can generate any amount of /c-mers from the ground truth. Let D denote the set 
of /c-mers generated. For simplicity, we assume that the throughput (number of bases generated) is 
fixed (e.g., \D\ is fixed). Then the number of reads N = 

Suppose we want to select a model from a family of models having the form (A,f), where A is a 
set of sequences and f is a set of associated relative abundances. Because A is an assembly, we 
require that every sequence's relative abundance is positive. In addition, we also introduce a to in f, 
which represents the probability of generating /c-mers from the background noise. The probability 
of generating any fc-mer is the same under the background noise. Thus |r | = M + 1, where M is 
the number of sequences in A. Each (A, f) also induces a distribution over /c-mers and we denote 
it as q(x). 

Given any instance of D, we can evaluate each candidate model (A,f) by its description length [4]. 
The description length of a model is the number of bits required to compress both the data D and 
the model itself (A, f ). We assume the number of sequences, M, follows distribution P(M) and the 
length of a sequence I follows a distribution of P(l). 

To encode a model, we need to encode the number of sequences, M, with — log 2 P(M) bits. Then 
we need to encode the length of each sequence with a total of — 1°§2 bits. After that, 
we need — log 2 \\A\ = 2\A\ bits to encode all of the nucleotides in A, where \A\ refers to the total 
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number of bases in A. Lastly, we need at least log 2 N bits to encode f [5] since we have M free 
parameters. Given the model, we use — log 2 q(x) bits to encode a /c-mer x. Thus the description 
length of (A, f) is 



f DL (A, r)=(j2~ lo S2 Q(x)) + (~ l°g 2 P(M) - £ log 2 P(k) + 2| A\ + ^ log 2 i\A . (24) 

s v ' S v ' 

bits to encode the data bits to encode the model 

Finally, our new reference-based measure is defined by taking expectation over D: 

/edl(A, f) = E[f DL (A, t)]=N- H(p, q) + 2\A\ + g(A, f), (25) 
where H(p, q) is the cross entropy and defined as 

H (p, q) = log 2 9(x )> 



and g(A, f ) is defined as 



M M 
g(A, f) = - log 2 P(M) - log 2 P(U) + -7T log 2 N. 



i=i 



8.2 Interpretation of the RSEM-EVAL score as a description length 

We can write — log P(A, D) as (see equation 2 in the main text) 

-\ogP{A,D) « - \ogP{A\K ML E) - \ogP(D\A,A MLE ) + \{M + 1) log A . 

V «, ' V «, ' J , 

assembly prior likelihood _ T „ V ' 

■ y v BIC term 

In the above equation, the assembly prior term encodes the assembly A, the BIC term encodes the 
parameters Amle and the likelihood term encode the data given the model. Thus we can interpret 
the RSEM-EVAL score as a description length in a better designed system (e.g., we require that 
the data covers the assembly). 

8.3 From expected description length to KC score 

Let us focus on A ■ H(p, q): 

N-H(p,q) = ^ Y | - log 2 Q J \-N Y p(x)log 2 q(x) 

\D\ 
~k 



Y p(x)2k-N Y P( x ) l °E2Q( x ) 



x&TAxfA x^TWxeA 

2 \ D \ Y P( x ) ~ N Y P( x ) lo &2Q( x ) 

xeT/\x£A xgTVxeA 

2\D\(1- P( X ))~ N E P(x)log 2 q(x), 

xeTAxeA xgTVxeA 
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where x G A means that the /c-mer x is present in the sequences of A (not from background noise) . 

In the real size RNA-Seq data sets we tested, it appears that 2|D|(1 — ^2 xe xAxeAP( x )) dominates 
N-H(p, q) and 2\A\ dominates the bits used to encode the model in the expected description length. 
Therefore we have 

fEDL(A,T) « 2|D|(1- P( X )) + 2 \ A \ 

xGTAxGA 

= 2\D\(l-( £ p(i)-& 

x^TAxtA ' ' 

= 2\D\(1 - score K c(^))- 
Thus maximizing KC score is roughly equivalent to minimizing the expected description length. 

9 Experiment details for the axolotl assembly 

To generate Fig. 6 in the main text, we used the following procedure to build a one-to-one mapping 
between the axolotl contigs and frog protein sequences. First, we aligned the axolotl contigs to the 
frog protein sequences using BLASTX and kept only those alignments with e-value < le-5. Then 
a one-to-one mapping was determined using reciprocal alignments. Reciprocal alignments were 
defined as those alignments that are best with respect to both the axolotl contig and frog protein 
sequence. Here, best means an alignment has the largest number of axolotl bases aligned to a frog 
protein and vice versa. 
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Supplementary tables 





Nucleotide-level F\ 


Contig-level F\ 


Novel measure 


Before 


0.92 


0.70 


0.98 


After 


0.92 


0.70 


0.98 



Table SI: Spearman's rank correlation coefficients between the RSEM-EVAL score w/o the tran- 
script length distribution related term and reference-based measures. No difference is observed. 





Number of contigs 


Median difference in 
Nucleotide-level F\ Contig-level F\ 


KC score 


RSEM-EVAL score 


Trinity 


142 


2.76e-4 


2.81e-5 


1.13e-4 


291234.82 


Oases 


22043 


1.19e-l 


2.80e-3 


1.53e-2 


38941405.71 


SOAPdenovo- Trans 


703 


-3.59e-4 


4.06e-5 


6.83e-7 


39309.76 


Trans- ABySS 


142491 


7.10e-2 


6.95e-3 


8.31e-3 


21906624.47 



Table S2: Effect of trimming contigs with negative contig score for assemblies. Median difference 
is the median of the difference of the trimmed measure and the untrimmed measure. Bold values 
indicate that the estimates are significantly (P < 0.05) more accurate, as assessed by a paired 
Wilcoxon signed rank test. The median values for Trans-ABySS are not significant because there 
is only one Trans-ABySS assembly. The median numbers of contigs that are trimmed is largest 
for Oases and Trans-ABySS assemblies. The median improvement of the trimmed assemblies over 
untrimmed assemblies is also largest, for all evaluation measures, for these two assemblers. 





[6] assembly 


RSEM-EVAL- guided assembly 


Number of contigs 


113,925 


173,130 


Total length of assembly 


71,027,573 


121,949,539 


Minimum contig length 


40 


201 


Maximum contig length 


7,943 


18,756 


Mean contig length 


623 


704 


First quartile 


423 


254 


Second quartile (median) 


500 


366 


Third quartile 


700 


720 


95th Percentile 


1,356 


2,456 


Standard Deviation 


414 


887 


N50 


650 


1,200 



Table S3: Comparison of the assembly statistics between the published axolotl assembly and the 
new RSEM-EVAL-guided assembly. 
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Supplementary figures 




(a) (b) 

Figure SI: Generative probabilistic models for RSEM-EVAL. (a) The "natural" model, which 
represents a natural definition of the process of generating both the RNA-Seq reads and their 
"true" assembly. In this model, we first generate the true transcript sequences and their relative 
expression levels. Then we use the RSEM model [3, 2] to generate an RNA-Seq data set. Lastly, the 
"true" assembly is defined based on the generated transcript sequences and hidden information from 
RNA-Seq reads, (b) The model used by RSEM-EVAL, which approximates the "natural" model 
to enable more efficient inference. In this model, we first generate an assembly and then generate 
a set of RNA-Seq reads given the assembly. To generate an assembly, each contig is assumed to be 
generated independently. To generate a contig m, its sequence length L rn is first picked and then 
its sequence A m is generated given its length. 
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Figure S2: RSEM-EVAL scores of "true" assemblies for the local regions around the maximal values 
on both simulated and real data sets. The maximum values (red circles) are achieved at w = 0 and 
w = 2 in a left-right order. 
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Figure S3: RSEM-EVAL scores with overlap length parameter set as 50 (top row) and ML (bottom 
row) scores of "true" assemblies for different values of the minimum overlap length on both simulated 
(left column) and real (right column) data sets. The maximum values (red circles) are achieved at 
w = 1, w = 2, w = 75 and w = 75 in a top-down, left-right order. For better visualization of the 
maximizing values of w, RSEM-EVAL scores for the local regions around the maximal values are 
shown in Figure S4. 
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Figure S4: RSEM-EVAL scores (with overlap length parameter set as 50) of "true" assemblies for 
the local regions around the maximal values on both simulated and real data sets. The maximum 
values (red circles) are achieved at w = 1 and w = 2 in a left-right order. 
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Figure S5: Correlation of the RSEM-EVAL score with reference-based measures on the strand- 
specific data sets [1]. Scatterplots are shown for the real mouse (top row) and real yeast (bottom 
row) data sets and for both the nucleotide-level F\ (left column) and contig-level F\ (center column) 
measures. For comparison, scatterplots of the nucleotide-level F± against the contig-level Fx are 
shown (right column). Because the versions of SOAPdenovo- Trans and Trans- ABySS we used did 
not take into account strand-specificity, they were not run on the two strand-specific data sets. The 
Spearman rank correlation coefficient (bottom-right corner of each plot) was computed for each 
combination of data set and reference-based measure. 
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Figure S6: Correlation of the N50 score with reference-based measures on the strand non-specific 
data sets. Scatterplots are shown for the simulated (top row) and real mouse (bottom row) data 
sets and for both the nucleotide-level F\ (left column) and contig-level F\ (right column) measures. 
The Spearman rank correlation coefficient (bottom-right corner of each plot) was computed for each 
combination of data set and reference-based measure. 
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Figure S7: Correlation of the N50 score with reference-based measures on the strand-specific data 
sets. Scatterplots are shown for the real mouse (top row) and real yeast (bottom row) data sets 
and for both the nucleotide-level F\ (left column) and contig-level F\ (right column) measures. The 
Spearman rank correlation coefficient (bottom-right corner of each plot) was computed for each 
combination of data set and reference-based measure. 
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Figure S8: Correlation of the RSEM-EVAL and KC scores on the strand-specific data sets. The 
Spearman rank correlation coefficient (bottom-right corner of each plot) was computed for each 
data set. 
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Figure S9: Correlation of the N50 and KC scores on the strand non-specific data sets. The Spearman 
rank correlation coefficient (bottom-right corner of each plot) was computed for each data set. 
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Figure S10: Correlation of the N50 and KC scores on the strand-specific data sets. The Spearman 
rank correlation coefficient (bottom-right corner of each plot) was computed for each data set. 
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Figure Sll: Correlation of the RSEM-EVAL score and KC score with k = 38 on the strand non- 
specific data sets. The Spearman rank correlation coefficient (bottom-right corner of each plot) was 
computed for each data set. 
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Figure S12: Correlation of the RSEM-EVAL score and KC score with k = 152 on the strand non- 
specific data sets. The Spearman rank correlation coefficient (bottom-right corner of each plot) was 
computed for each data set. 
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Figure S13: Within-assembler correlation of the RSEM-EVAL and KC scores on the strand-specific 
data sets. Scatterplots are shown for the real mouse (top row) and real yeast (bottom row) data 
sets and for the Trinity (left column) and Oases (right column) assemblers. The Spearman rank 
correlation coefficient (bottom-right corner of each plot) was computed for each combination of data 
set and assembler. 
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Figure S14: Scatterplots of evaluation measures between the untrimmed assemblies and trimmed 
assemblies. Measures of the trimmed assemblies were plotted against those for the untrimmed 
assemblies. The diagonal lines, y = x, are shown for easy visualization. The largest improvements 
were seen for assemblies produced by Oases and Trans- ABySS. For both the nucleotide- and contig- 
level F\ scores, the trimmed Oases assemblies were the most accurate of all assemblies (both trimmed 
and untrimmed). 
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Figure S15: The number of genes expressed (TPM > 1) (top) and the number of DE UP genes 
(bottom) at each time point in the RSEM-EVAL guided (black) and [6] (red) assemblies. 



31 



